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Abstract. The PC algorithm uses conditional independence tests for model 
selection in graphical modeling with acyclic directed graphs. In Gaussian mod- 
els, tests of conditional independence are typically based on Pearson correla- 
tions, and high-dimensional consistency results have been obtained for the PC 
algorithm in this setting. We prove that high-dimensional consistency carries 
over to the broader class of Gaussian copula or nonparanormal models when 
using rank-based measures of correlation. For graphs with bounded degree, 
our result is as strong as prior Gaussian results. In simulations, the 'Rank PC 
algorithm works as well as the 'Pearson PC algorithm for normal data and 
considerably better for non-normal Gaussian copula data, all the while incur- 
ring a negligible increase of computation time. Simulations with contaminated 
data show that rank correlations can also perform better than other robust es- 
timates considered in previous work when the underlying distribution does not 
belong to the nonparanormal family. 



1. Introduction 

Let G = {V, E) be an acyclic digraph with finite vertex set. We will typically 
write V ^ w ^ E io indicate that {v, w) is an edge in E. The digraph G determines 
a statistical model for the joint distribution of a random vector X = (X„)„gy 
by requiring that X satisfy conditional independence relations that are natural if 
the edges in E encode causal relationships among the random variables X^. We 
refer the reader to |Lau961 IPea091 ISC^SOOj or |DSS091 Chap. 3] for background on 
statistical modeling with directed graphs. As common in this field, we use the 
abbreviation DAG (for 'directed acyclic graph') to refer to acychc digraphs. 

The conditional independences associated with the graph G may be determined 
using the concept of d-separation. Since a DAG contains at most one edge between 
any two nodes, we may define a path from a node m to a node w to be a sequence 
of distinct nodes {vq, vi, . . . , Vn) such that vq — u, u„ — v and for all 1 < fc < n, 
either Vk-i — > f fe G £' or f^-i ^ Vk G E. Two distinct nodes u and v are then said 
to be d-separated by a set C \ {v, u} if every path from u to v contains three 
consecutive nodes {vk-i, Wfc, Wfe+i) for which one of the following is true: 

(i) The three nodes form a chain Vk-i — > ffc — > Vk+i, a chain Vk-i ^ Vk 
Vk+i, or a fork v^-i ^ Wfc — > Wfc+i, and the middle node Vk is in S. 

(ii) The three nodes form a collider Vk-i — >■ Wfe Vk+i, and neither Vk nor any 
of its descendants is in S. 

Suppose A, B, S are pairwise disjoint subsets of V. Then S d-separates A and B 
if S d-separates any pair of nodes a and b with a d A and b B. Finally, the 
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joint distribution of the random vector X = {Xy)y^v is Markov with respeet to 
a DAG G if Xj^ and Xb are conditionally independent given Xg for any triple 
of pairwise disjoint subsets A,B,S C V such that S d-separates A and B in G. 
Here, Xa denotes the subvector {Xy)y^A- It is customary to denote conditional 
independence of Xa and Xb given Xg by Xa AL Xb\ Xs- 

We will be concerned with the consistency of an algorithm for inferring a DAG 
from data. Graph inference is complicated by the fact that two DAGs G = {V,E) 
and H = (V, F) with the same vertex set V may be Markov equivalent, that is, they 
may possess the same d-separation relations and, consequently, induce the same 
statistical model. To give an example, the graphs m — ^ w — ^ w and m w u; 
are Markov equivalent, but u — > f — > w and u — > v w are not. As first shown 
in |VP91) . two DAGs G and H are Markov equivalent if and only if they have the 
same skeleton and the same unshielded colliders. The skeleton of a digraph G is 
the undirected graph obtained by converting each directed edge into an undirected 
edge. An unshielded collider is a triple of nodes (u, v, w) that induces the subgraph 
u ^ V w, that is, there is no edge between u and w. 

Let [G] be the Markov equivalence class of an acyclic digraph G = (V, E). Write 
E{H) for the edge set of a DAG H, and define the edge set 

[E]= U E{H). 
He[G\ 

That is, (d, w) G [E\ if there exists a DAG H G [G] with the edge v —i' w in its edge 
set. We interpret the presence of both {v,w) and (w, u) in [E] as an undirected 
edge between v and w. Following the most closely related literature, we call the 
graph C{G) = {V, [E]) the completed partially directed acyclic graph (CPDAG) for 
G, but other terminology such as the essential graph is in use. The graph C{G) is 
partially directed as it may contain both directed and undirected edges, and it is 
acyclic in the sense of its directed subgraph having no directed cycles. Two DAGs 
G and H satisfy G(G) = C[H) if and only if [G] = [H], making the C PDAG a 
useful graphical representation of a Markov equivalence class; see |AMP971 IChi02j . 

The PC algorithm, named for its inventors Peter Spirtes and Clark Glymour, 
uses conditional independence tests to infer a CPDAG from data SGSOO . In 
its population version, the algorithm amounts to a clever scheme to reconstruct 
the CPDAG G(G) from answers to queries about d-separation relations in the 
underlying DAG G. Theorem [T] summarizes the properties of the PC algorithm 
that are relevant for the present paper. For a proof of the theorem as well as a 
compact description of the PC algorithm we refer the reader to |KB07| . Recall that 
the degree of a node is the number of edges it is incident to, and that the degree of 
a DAG G is the maximum degree of any node, which we denote by deg(G). 

Theorem 1. Given only the ability to check d-separation relations in a DAG G, the 
PC algorithm finds the CPDAG G{G) by checking whether pairs of distinct nodes 
are d-separated by sets S of cardinality \S\ < deg(G). 

The joint distribution of a random vector X — {Xy)y^v is faithful to the DAG G 
if, for any triple of pairwise disjoint subsets A, B, S d V, we have that 5* d-separates 
A and B in G if and only if Xa -LL Xb \ Xs- Under faithfulness, statistical tests 
of conditional independence can be used to determine d-separation relations in a 
DAG and lead to a sample version of the PC algorithm that is applicable to data. 
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If X follows the multivariate normal distribution N{^^ S), with positive definite 
covariance matrix S, then 

(1.1) Xa^Xb\Xs ^ Xu^LX,\Xs Vw e A, e B. 

Moreover, the pairwise conditional independence of X^ and Xy given Xs is equiva- 
lent to the vanishing of the partial correlation Puv\Si that is, the correlation obtained 
from the bivariate normal conditional distribution of (X„,X„) given Xs- The iter- 
ations of the PC algorithm make use of the recursion 

(-.X Puv\S\w Puw\S\wPvw\S\w 

(.t-^j Puv\S ^ 



P'i,w\S\w) Plw\S\tu) 



where w € S, and we define Puv\il> — Puv to be correlation of u and v. Our later 
theoretical analysis will use the fact that 

(1-3) Puv\S ^ 



1 

vv 



where 4* = ^{u.v,S},iu,v,s) is the concerned principal submatrix of E. A natural 
estimate of Puv\s is the sample partial correlation obtained by replacing E with the 
empirical covariance matrix of available observations. Sample partial correlations 
derived from independent normal observations have favorable distributional prop- 
erties |And03[ Chap. 4], which form the basis for the work of jKB07| who treat the 
PC algorithm in the Gaussian context with conditional independence tests based on 
sample partial correlations. The main results in [KB07J show high-dimensional con- 
sistency of the PC algorithm, when the observations form a sample of independent 
normal random vectors that are faithful to a suitably sparse DAG. 

The purpose of this paper is to show that the PC algorithm has high-dimensional 
consistency properties for a broader class of distributions, when standard Pearson- 
type empirical correlations are replaced by rank-based measures of correlations 
in tests of conditional independence. The broader class we consider comprises 
the distributions with Gaussian copula. Phrased in the terminology of |LLW09j . 
we consider nonparanormal distributions. Recall that a correlation matrix is a 
covariance matrix with all diagonal entries equal to one. 

Definition 1. Let f = {fv)vev be a collection of strictly increasing, but not nec- 
essarily continuous functions /„ : M — > M, and let S G M^^^ be a positive definite 
correlation matrix. The nonparanormal distribution NPN{f, E) is the distribution 
of the random vector {fv{Zv))v£V for ~ A^(0,E). 

Taking the functions /„ to be affine shows that all multivariate normal distri- 
butions are also nonparanormal. li X ^ NPN{f,Y^), then the univariate marginal 
distribution for a coordinate, say Xy, may have any continuous cumulative distri- 
bution function F, as we may take /„ = o $, where $ is the standard normal 
distribution function. 

Definition 2. The Gaussian copula graphical model NPN{G) associated with a 
DAG G is the set of all distributions NPN(f, E) that are Markov with respect to G. 

Since the marginal transformations /„ are deterministic, the dependence struc- 
ture in a nonparanormal distribution corresponds to that in the underlying la- 
tent multivariate normal distribution. In other words, ii X ^ NPN{f, S) and 
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Z ^ N{0, E), then it holds for any triple of pairwise disjoint sets A, B, S C V that 

(1.4) XaALXb\Xs ^ ZaALZb\Zs. 

Hence, for two nodes u and v and a separating set C F \ {m, v}, it holds that 

(1.5) X^ALX,\Xs ^ Puv\s = 0, 



with Puv\s calculated from E as in ( 1.2 ) or ( 1.3 1. In light of this equivalence, we will 
occasionally speak of a correlation matrix U being Markov or faithful to a DAG, 
meaning that the requirement holds for any distribution NPN{f, E). 

In the remainder of the paper we study the PC algorithm in the nonparanor- 
mal context, proposing the use of Spearman's rank correlation and Kendall's t for 
estimation of the correlation matrix parameter of a nonparanormal distribution. 
In Section [2] we review how transformations of Spearman's rank correlation and 
Kendall's t yield accurate estimators of the latent Gaussian correlations. In partic- 



ular, we summarize tail bounds from LHY"'"12 . Theorem [2] in Section [s] gives our 



main result, an error bound for the output of the PC algorithm when correlations 
are used to determine nonparanormal conditional independence. In Corollary[T| we 
describe high-dimensional asymptotic scenarios and suitable conditions that lead 
to consistency of the PC algorithm. The proof of Theorem [2] is given in Section |4] 
Our simulations in Section [5] make a strong case for the use of rank correlations in 
the PC algorithm. Some concluding remarks are given in Section [6] 

2. Rank correlations 

Let {X, Y) be a pair of random variables, and let F and G be the cumulative 
distribution functions of X and Y, respectively. Spearman's p for the bivariate 
distribution of {X, Y) is defined as 

(2.1) p^ = Corr(F(X),G(r)), 

that is, it is the ordinary Pearson correlation between the quantiles F{X) and G{Y). 
Another classical measure of correlation is Kendall's t, defined as 

(2.2) T = Corr (sign {X - X') , sign {Y - Y')) 

where {X' ,Y') is an independent copy of {X,Y). 

Suppose {Xi,Yi), . . . (X„,y„) are independent pairs of random variables, each 
pair distributed as {X,Y). Let rank(Ari) be the rank of Xi among Xi, . . . ,X„. In 
the nonparanormal setting, the marginal distributions are continuous so that ties 
occur with probability zero, making ranks well-defined. The natural estimator of 

is the sample correlation among ranks, that is, 

'rank(yi) 1^ 



1 / rank(Xi) 1 ^ Z' ' 

(2.3) p^ = ^ ^ ^ 



2 



1 / rank(JCi) \\ / 1 / r 



rank(yi) 1 ' ^ 



2 



(2.4) = 1 - . , ^. (rank(X,) - rank(r,))' , 

n [n^ — 1 — ' 

^ ' 1=1 

which can be computed in 0{nlogn) time. Kendall's r may be estimated by 

(2.5) T= \ sign (^. -^,) sign (r,-r,). 

n[n— 1 

l<i<j<n 
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A clever algorithm using sorting and binary trees to compute f in time O(nlogn) 
instead of the naive O(n^) time has been developed by |Chr05j . 

It turns out that simple trigonometric transformations of and f are excellent 
estimators of the population Pearson correlation for multivariate normal data. In 
particular, |LIIY+12] show that if (X, Y) are bivariate normal with Corr(X, Y) = p, 
then 



(2.6) 

and 

(2.7) 



(|2sin(^ 



P 



> e] < 2 exp 



97r2 



sm I -r) - p 



> e 



< 2 cxp -ne" 



. . (X„, Yn) only through 



Clearly, and depend on the observations {Xi , Yi ) 
their ranks. Since ranks are preserved under strictly increasing functions, (2.6) and 
( [2J1 ) stih hold if {X,Y) ~ NPN{f,E) with Pearson correlation p = S^jy in the 
underlying latent bivariate normal distribution. Throughout the rest of this paper, 
we will assume that we have some estimator p of p which has the property that, 
for nonparanormal data. 



(2.8) 



|p-p|>e) < 74exp(-Bne^ 



for fixed constants < A, _B < oo. As just argued, the estimators considered in 



(2.6) and (2.7) both have this property. 



When presented with multivariate observations from a distribution iVPiV(/, S), 

to every pair of coordinates to obtain an estimator 



we apply the estimator from ( 2 



S of the correlation matrix parameter. Plugging E into (1.2) or equivalently into 



(1.3) gives partial correlation estimators that we denote p. 



'uv\S- 



3. Rank PC algorithm 



Based on the equivalence (1.5), we may use the rank-based partial correlation 
estimates Puv\s to test conditional independences. In other words, we conclude that 



(3.1) 



XuALXy\Xs \Puv\s\<li 

We will refer to the PC algorithm that uses 



where 7 G [0, 1] is a fixed threshold, 
the conditional independence tests from (3.1) as the 'Rank PC (RPC) algorithm 



We write C^{G) for the output of the RPC algorithm with tuning parameter 7. 

The RPC algorithm consist of two parts. The first part computes the correlation 
matrix S = {puv) in time 0{p^n\ogn), where p := \V\. This computation takes 
0(log n) longer than its analogue under use of Pearson correlations. The second part 
of the algorithm is independent of the type of correlations involved. It determines 
partial correlations and performs graphical operations. For an accurate enough 
estimate of a correlation matrix E that is faithful to a DAG G, this second part 
takes 0(p'^°s(G)) time in the worst case, but it is often much faster; compare |KB07j . 
For high-dimensional data with n smaller than p, the computation time for RPC is 
dominated by the second part, the PC-algorithm component. Moreover, in practice, 
one may wish to apply RPC for several different values of 7, in which case the 
estimate E needs to be calculated only once. As a result. Rank PC takes only 
marginally longer to compute than Pearson PC for high-dimensional data. 
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What foUows is our main result about the correctness of RPC, which we prove 
in Section [4] For a correlation matrix E G M^^^, let 

(3.2) Cmin(S) := min{|p„„|s| : u, w G V", C F \ {u, w}, 7^ O} 

be the minimal magnitude of any non-zero partial correlation, and let Aniin(S) be 
the minimal eigenvalue. Then for any integer q > 2, let 

(3.3) Cmin(S,g) := min{cmiii(I]/^/) : / C V', |/| = g} , and 

(3.4) Xmin{^,q) ■■= min{A„i„(I]/,j) : ICV,\I\=q} 

be the minimal magnitude of a non-zero partial correlation and, respectively, the 
minimal eigenvalue of any q x q principal submatrix of S. Note that if / C J then 

Cmini^Ij) < Cmin(Sj,j) and A,„in(S/j) < A,„in(Sj,j). 

Theorem 2 (Error bound for RPC-algorithm) . Let Xi, . . . , he a sample of 
independent observations drawn from a nonparanormal distribution NPN{f, E) that 
is faithful to a DAG G with p nodes. For q := deg(G') -I- 2, let c :~ Cnii„(E, q) and 
A :— Ainin(S,q). If n > q, then there exists a threshold 7 G [0, 1] for which 

P{C,{G) ^ C{G)) < 4p'exp 



2 " "V 36g- 



where < A, B < 00 are the constants from (2.8). 



We remark that while all subsets of size q appear in the definitions in (3.3 1 and 

(3.4) , our proof of Theorem [2] only requires the corresponding minima over those 
principal submatrices that are actually inverted in the run of the PC-algorithm. 

From the probability bound in Theorem [2] we may deduce high-dimensional 
consistency of RPC. For two positive sequences (s„) and (in), we write s„ = 0(t„) 
if Sn < Mt„, and s„ = ri(tn) if > Mtn for a constant < M < 00. 

Corollary 1 (Consistency of RPC-algorithm). Let {Gn) be a sequence of DAGs. 
Let pn be the number of nodes of Gn, and let g„ = deg(G„) -1-2. Suppose (S„) is a 
sequence of Pn x p„ correlation matrices, with S„ faithful to G„ . Suppose further 
that there are constants < a,b,d, f < 1 that govern the growth of the graphs as 

logp„ = Oin'^), qn = Gin"), 

and minimal .signal strengths and eigenvalues as 

Cmin(S„, qn) = ^{n^'^), A,nin(S„, g„) = iliu"^). 

If a + 2b + 2d + Af < 1, then there exists a sequence of thresholds 7„ for which 

lim P(G^„(G„) = G(G„)) - 1, 

n— >-oo 

where G-y,^(G„) is the output of the RPC algorithm for a sample of independent 
observations Xi, . . . , X„ from a nonparanormal distribution NPN{ ■ , S„). 

Proof. By Theorem [2] for large enough n, we can pick a threshold 7„ such that 

(3.5) FiC^^ {Gn) ^ G(G„) < A' exp (2n" - S'ni-26-2d-4/^ 

for constants < A' ,B' < 00. The bound goes to zero if 1 — 26 — 2(i — 4/ > a. □ 
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As previously mentioned, |KB07j prove a similar consistency result in the Gauss- 
ian case. Whereas our proof consists of propagation of errors from correlation to 
partial correlation estimates, their proof appeals to Fisher's result that under Gaus- 
sianity, sample partial correlations follow the same type of distribution as sample 
correlations when the sample size is adjusted by subtracting the cardinality of the 
conditioning set [And03| Chap. 4]. It is then natural to work with a bound on 
the partial correlations associated with small conditioning sets. More precisely, 
|KB07| assume that there is a constant < i\f < 1 such that for any n, the partial 
correlations Puv\s of the matrix S„ satisfy 

(3.6) \Puv\s\<M yu,veV, SCV\{u,v},\S\<qn. 



It is then no longer necessary to involve the minimal eigenvalues from (3.4). The 
work in |KB07j is thus free of an analogue to our constant /. Stated for the case of 
polynomial growth oipn (with a = 0), their result gives consistency when b+2d < 1, 
whereas our condition requires 2b + 2d < 1 even if / = 0. (Note that our constant 
b corresponds to 1 — 6 in jKB07 !.) 

In the important special case of bounded degree, however, our nonparanormal 
result is just as strong as the previously established Gaussian consistency guarantee. 
Staying with polynomial growth of p„, i.e., a — 0, suppose the sequence of graph 
degrees deg(G„) is indeed bounded by a fixed constant, say qo — 2. Then clearly, 



b = 0. Moreover, the set of correlation matrices of size Qq satisfying (3.6) with 
Qn = Qo is compact. Since the smallest eigenvalue is a continuous function, the 
infimum of all eigenvalues of such matrices is achieved for some invertible matrix. 
Hence, the smallest eigenvalue is bounded away from zero, and we conclude that 
/ = 0. Corollary [ijthus implies consistency if 2cZ < 1, or if d < ^ = precisely as 
in |KB07j . (No generality is lost by assuming a = 0; in either one of the compared 
results this constant is involved solely in a union bound over order events.) 



4. Proof of the error bound 

In this section, we prove the error bound in Theorem [2j Our argument starts 
from a uniform bound on the error in our estimate E. Then we analyze how this 
error propagates to the partial correlation estimates Puv\Sj giving again a uniform 
error bound. We begin by proving three lemmas about the error propagation. 

The first lemma invokes classical results on error propagation in matrix inversion. 
Let \\A\\ denote the spectral norm of a matrix A = (ajj) = K*^'', that is, is 
the maximal eigenvalue of A^A. Write the loo vector norm of A as 

||A||oo = rnax \aij\. 

Lemma 1 (Errors in matrix inversion). Suppose E G K"?^' is an invertible matrix 
with minimal eigenvalue Amin- If E d M'x? ^ matrix of errors with ||-E||oo < e < 
Amin/'Z; then T, + E is invertible and 

Proof. First, note that 

(4.1) lli?||oo<||i?ll<<7||i?||oo; 
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see entries (2,6) and (6,2) in the table on p. 314 in jHJ90| . Using the submulti- 
plicativity of a matrix norm, the second inequaUty in (4.1 ), and our assumption on 
e, we find that 

(4.2) \\ej:'^<\\j:~^-\\e\\<^<i. 

As discussed in |HJ90[ Sect. 5.8], this implies that / + EJ^^^ and thus also Y, + E 
is invertible. Moreover, by the first inequality in (4.11 and inequality (5.8.2) in 
|H J90| ■ we obtain that 

(4.3) ||(E + E)-' - E-^llo, < ||(S + E)-' E-^ll < • /^^^'j^^^ . 

Since the function x i— x/{l — x) is increasing for a; < 1, our claim follows from the 
fact that IIS^-^II ~ l/Xmm and the inequahty < qe/Amin from (4.2). □ 

Lemma 2 (Diagonal of inverted correlation matrix). // E e M"?^"? is a positive 
definite correlation matrix, then the diagonal entries ofT,^^ — {cr^-') satisfy a" > 1. 

Proof. The claim is trivial for q = 1. So assume q > 2. By symmetry, it suffices to 
consider the entry a''', and we partition the matrix as 

(4.4) ^=[^ \ 

with A e m(9-i)x(«-i) and b e R'^^K By the Schur complement formula for the 
inverse of a partitioned matrix, 

rrll = I • 



compare |HJ90[ §0.7.3]. Since A is positive definite, so is A^^. Hence, b^A~^b > 0. 
Since is positive definite, cr'^'? cannot be negative, and so we deduce that aqq > 
1, with equality if and only if & = 0. □ 



The next lemma treats the error propagation from the inverse of a correlation 
matrix to partial correlations. 

Lemma 3 (Error in partial correlations). Let A — {aij) and B ~ {bij) be symmetric 
2x2 matrices. If A is positive definite with an, 022 > 1 \\A — i?||oo < <5 < 1, 
then 

ai2 bi2 

\/aiia22 \/biib22 

Proof. Without loss of generality, suppose 012 > 0. Since \\A — B\\ao < S, 

bi2 ai2 ^ ai2 + (5 ai2 

\/biib22 \/0'\\0'22 (ail ~ <^) (122 ^ -\/aiia22 



v/(aii - 5) [022 - 5) \ V(aii - 5) (022 - 5) V"ii"22 



< 



25 
\-6' 
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Using that an, 022 > 1 to bound the first term and 0^2 < aiia22 to bound the 
second term, we obtain that 

^'12 0,12 S I 1 1 \ 

< J + V«iia22 



V&11&22 ^/alla22 ^-6 " I y(ai7^^^Ha22^-^ yaiia22 



(> , I I On a22 _ 



1 — 6 \Y Oil — S 022 — S 

Since the function x t-^ x/{x — 6) is decreasing, we may use our assumption that 
Oil, 022 > 1 to get the bound 



6,2 a,2 ,M = 2<5 



\/^ii622 V"iia22 \\jl-5 1-S I 1-5 

A similar argument yields that 

(4.5) - <r 



from which our claim follows. □ 

We are now ready to prove our main result. 

Proof of Theorem We will show that our claimed probability bound for the event 
Cj{G) ^ C{G) holds when the threshold in the RPC algorithm is 7 = c/2. By 
Theorem [1] if all conditional independence tests for conditioning sets of size \S\ < 
deg(G) make correct decisions, then the output of the RPC algorithm C^{G) is 
equal to the CPDAG C{G). When 7 — c/2, the conditional independence test 
accepts a hypothesis _LL Xy\Xs if and only if |/5„t,|s| < 7 = c/2. Hence, the 
test makes a correct decision if |/5„^,|5 — Puv\s\ < c/2 because all non-zero partial 



correlations for \S\ < deg(G) are bounded away from zero by c; recall ( |3.2[ ) and (3.3 1 
It remains to show, using the error analysis from Lemmas [T] and [3j that the event 
\Puv\S ~ Puv\s\ ^ c/2 occurs with small enough probability when \S\ < deg(G). 
Suppose our correlation matrix estimate E = (fuv) satisfies ||E — E||oo < e for 

(4.6) e - , > 0. 

(4 + c)g + Xcq 

Choose any two nodes u,v E V and a set S C V \ {u, v} with \S\ < deg(G) = q — 2. 
Let / = {u,v} U S and define ^' = ^1,1 and 5* = E/j, two principal submatrices 
of size at most q. For the choice of e from (4.6), the assumptions of Lemma[l]hold 
and we deduce that ^ is invertible, with 

(4.7) ||*-^-*-^|| < 



1 - qe/X (4 + c)q + Xcq -Xcq 4 + c 

By Lemma[2j all diagonal entries of ^t^^ are equal to one or greater, and so we can 
apply Lemma[3]with (4.7 1. Letting 6 = c/{A + c), we get that 



\Puv\S - Puv\s\ 



uv uv 



2S _ c 
1 - S ~ 2' 



Therefore, |jE — S||oo < e implies that our tests decide all conditional independences 
correctly in the RPC algorithm. 
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Next, using ( |2.8[ ) and a union bound, we find that 

'C'^(G) ^ C(G) \ < P (|S„„ - S„„| > e for some u,v€V 



<Afc^exp(-W). 
Plugging in the definition of e gives the claimed inequality 

A ^ ( BX^nc^ \ A 



nc 



. :C,(G) ^ G(G)] < ^p^exp ~- ^ — ^ < -p^exp ^ 

because c < 1 and A < 1. The inequality c < 1 holds trivially because partial 
correlations are in [—1, 1]. The inequality A < 1 holds because a q x q correlation 
matrix has trace q, this trace is equal to the sum of the q eigenvalues, and A is the 
minimal eigenvalue. □ 

5. Simulations 

In this section we evaluate the finite-sample performance of the RPC algorithm 
in simulations. We compare RPC to two other versions of the PC-algorithm: (i) 
'Pearson-PC, by which we mean the standard approach of using sample partial 
correlations to test Gaussian conditional independences, and (ii) '(3„-PC', which is 
based on a robust estimator of the covariance matrix and was considered in |KB08| . 
All our computations are done with the pcalg package for R [KMC^ 12) . 

The Gaussian conditional independence tests in the pcalg package (and other 
software such as Tetrad use a level a S [0, 1] and decide that 



(5.1) X^ALX,\Xs ^ ^n-\S\-3 



^logfil^ 

2 \^ ~ Puv\S 



< ^-'{l~a/2). 



If the observations are multivariate normal and Puv\s ^'^e sample partial correlations 



then a is an asymptotic significance level for the test. The test in (5.1 1 is equivalent 



to our earlier setup of conditional independence tests in (3.1), with the exception 
of the sample size adjustment from n to n — \S\ — 3. This adjustment is motivated 
by classical large-sample bias-correction theory for Fisher's z-transform of sample 
correlations; compare |And03| . We show in the appendix that the adjustment has 
no affect on the consistency result we proved in Corollary [T] 

Following the setup of |KB07| . we simulate random DAGs and sample from 
probability distributions faithful to them. Fix a sparsity parameter s € [0, 1] and 
enumerate the vertices as V = {1, . . . Then we generate a DAG by including 
the edge u ^ v with probability s, independently for each pair (u, v) with 1 < u < 
V < p. In this scheme, each node has the same expected degree, namely, {p — l)s. 

Given a DAG G = {V,E), let A = (Xuv) he a p x p matrix with X^v = if 
u ^ V ^ E. Furthermore, let e = (ei, . . . , e^) be a vector of independent random 
variables. Then the random vector X solving the equation system 

(5.2) X = KX + e 

is well-known to be Markov with respect to G. Here, we draw the edge coefficients 
Xuvi u ^ V (li E, independently from a uniform distribution on the interval (0.1, 1). 



For such a random choice, with probability one, the vector X solving (5.2 1 is faithful 
with respect to G. We consider three different types of data: 

^http://www. phil.cmu.edu/projects/tetrad 
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(i) multivariate normal observations, which we generate by taking e in (5.2) to 
have independent standard normal entries; 

(ii) observations from the Gaussian copula model, for which we transform the 
marginals of the normal random vectors from (i) to an i^i^i-distribution; 



(iii) contaminated data, for which we generate the entries of e in (5.2) as in- 
dependent draws from a 80-20 mixture between a standard normal and a 
standard Cauchy distribution. 

The contaminated distributions in (iii) do not belong to the nonparanormal class. 
We consider the RPC algorithm in the version that uses Spearman correlations 



as in ( 2.6 ); the results for Kendall's t are similar. When comparing graph estimates, 
we use the Structural Hamming Distance (SHD) as a measure of distance. The SHD 
is the number of edge insertions, deletions, and reorientations required to transform 
one graph to another. An undirected edge is counted as a single edge. 
For the simulations we consider each combination of 

pe {10,22,46, 100} and n e {32, 100, 316, 1000, 3162}, 

and choose the expected degree as either 3 or 6. In each case, we draw 240 random 
graphs and then generate samples of n observations. For the tuning parameter in 



(5.1), we consider a fixed grid, namely, 

logio a e {-7, -6, -5, -4.25, -3.5, -2.75, -2, -1.5, -1, -0.75}. 

For each of the resulting combinations, we run each of the three considered versions 
of the PC algorithm, retaining the result for the best choice among 7 values for 
a, best in terms of lowest average SHD to the true underlying DAG for a given 
combination. In Figures [T] [2] and [3) we plot the resulting SHDs against the sample 
size n. 

A clear message emerges from the plots. First, Figure [T] shows that for normal 
data, RPC performs only marginally worse than Pearson-PC. The (5„-PC algorithm 
also does well, although some gap in SHD arises for small sample sizes. Second, 
Figure [2] shows a dramatic gain in performance for RPC for the Gaussian copula 
data with Fi.i marginals. In fact, the SHD associated with the other two graph 
estimators is comparable to that of estimating the graph to always be empty. The 
expected SHD between the empty graph and a graph on p nodes with expected 
degree d is simply the expected number of edges in our random graphs, which is 
pd/2. For our choices of c? = 3 and d = 6, the respective expected SHD is 150 and 
300 when p — 100. Finally, Figure [3] shows that RPC outperforms Qn-PC for the 
contaminated data; Q^-PC outperforms Pearson-PC for larger choices of p. 

6. Conclusion 

The PC algorithm of |SGSOO) addresses the problem of model selection in graph- 
ical modelling with directed graphs via a clever scheme of testing conditional inde- 
pendences. For multivariate normal observations, the algorithm is known to have 
high-dimensional consistency properties when conditional independence is tested 
using sample partial correlations |KB07) . We show that the PC algorithm retains 
these consistency properties when observations follow a Gaussian copula model and 
rank-based measures of correlation are used to assess conditional independence. The 
assumptions needed in our analysis are no stronger than those in prior Gaussian 
work when the considered sequence of DAGs has bounded degree. When the degree 
grows our assumptions are slightly more restrictive as our proof requires control of 
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-| 1 1 1 r-' '-1 1 1 1 r- 

1.5 2.0 2.5 3.0 3.5 1.5 2.0 2.5 3.0 3.5 

loglO(n) loglO(n) 



p = 46 p = 100 




1.5 2.0 2.5 3.0 3.5 1.5 2.0 2.5 3.0 3.5 

Iog10(n) loglO(n) 



Figure 1. Structural Hamming distances for normal data, 

graphs with expected degree 3 (sohd lines) and 6 (dotted lines), 
and three versions of the PC algorithm: Pearson-PC (thin lines), 
Qn-PC (medium lines) and RPC using Spearman's p (thick lines). 



the conditioning of principal submatrices of correlation matrices that are inverted 
to estimate partial correlations in the rank-based PC (RPC) algorithm. 

Our simulations show that for normal data the RPC algorithm does essentially 
as well as the sample correlation-based version of the algorithm. As can be ex- 
pected, we see RPC retain this performance for Gaussian copula data, for which 
sample correlations arc poorly suited. Somewhat surprisingly, RPC also performed 
better than a previously considered robust version of the PC algorithm under a 
contamination model. We remark that the consistency theory available for this ro- 
bust version is for a fixed graph size p. Since rank correlations take only marginally 
longer to compute than sample correlations, hardly any downsides are associated 
with making RPC the standard version of the PC algorithm for continuous data. 

In our work on consistency, the data-generating distribution is assumed to be 
faithful to an underlying DAG. In fact, our results make the stronger assump- 
tion that non-zero partial correlations are sufficiently far from zero. As shown in 
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Iog10(n) 



Figure 2. Structural Hamming distances for Gaussian copula 
data with Fi^i marginals, graphs with expected degree 3 (solid 
lines) and 6 (dotted lines), and three versions of the PC algorithm: 
Pearson-PC (thin lines), Q„-PC (medium lines) and RPC using 
Spearman's p (thick lines). 



[URYBj ■ this can be a restrictive assumption, which provides an explanation for 
why consistency does not 'kick-in' quicker in our simulation study. 

Finally, we remark that extensions of the PC algorithm exist to deal with situa- 
tions in which some causally relevant variables remain unobserved. Such algorithms 
infer a more complex graphical object; compare [SGSOO] and |CMKRl"2] . It is rea- 
sonable to expect the use of rank correlations to be beneficial in those settings as 
well, and a study of these algorithms would be an interesting topic for future work. 
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1.5 2.0 2.5 3.0 3.5 1.5 2.0 2.5 3.0 3.5 
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Figure 3. Structural Hamming distances for contaminated data, 
graphs with expected degree 3 (sohd Hues) and 6 (dotted hnes), 
and three versions of the PC algorithm: Pearson-PC (thin lines), 
Qn-PG (medium lines) and RPC using Spearman's p (thick lines). 



Appendix A. Sample size adjustment 

We now show that the consistency result in Corollary [T] still holds when using 
the conditional independence tests from (5.1 1. In these tests, the sample size is 
adjusted from n to n — \S\ — 3. 



Proof. The test in (5.1) accepts a conditional independence hypothesis if and only 
if 



(A.l) 
where 



\Puv\s\ < 7('^J5'|,z), 



(A.2) 



7(n, |S'|,z) = 



exp {z/y/n- \S\ -3) - 1 
exp {z/y/n^\S\-3) + 1 
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and z = z{a) = 2$^^(1 — a/2). We need to find a sequence (a„) of values for a 
such that consistency holds under the scaling assumptions made in Corollary [T] We 
will do this by specifying a sequence (z„) for values for the (doubled) quantiles z. 

We claim that the RPC algorithm using the tests from ( |A.l ) is consistent when 
choosing the quantile sequence 

1 + c„/3 
1 - c„/3 



(A.3) Zn = - 3 • log 

where we use the abbreviation 

We will show that as the sample size n tends to infinity, with probability tending 
to one, \puv\s ~ Puv\s\ < c„/3 for every u,v ^ V and 15*1 < q„. Furthermore, 
we will show that for the above choice of z„ and all sufficiently large n, we have 
Cn/3 < 7(n, |5'|,z„) < 2c„/3 for each relevant set S with < jS'l < q„. These two 
facts imply that, with asymptotic probability one, every conditional independence 
test is correct, and the RPC algorithm succeeds. 

First, we slightly adapt the proof of Theorem [2] Choosing the uniform error 
threshold for the correlation estimates as 

(o + c)q + Acq 



in place of (4.6) yields that, with probability at least 
(A.5) l--p2exp - 



2 " "V 64g2 

we have that \puv\s~Puv\s\ < c/3 for every u,v gV and jS'l < q. When substituting 
and Amin(S„,<7„) for p, q, c and A, respec tively , the scaling assumptions 
in Corollary [l] imply that the probability bound in ( |A.5[ ) tends to one as n — oo, 
and we obtain the first part of our claim. 

For the second part of our claim, note that our choice of Zn in (A.3) gives 
7(n, 0, z„) = c„/3. Since j{n,\S\,z) is monotonically increasing in jS'l, we need 
only show that for sufficiently large n, 

7(n, qn, Zn) — 7(71, 0, 

For a: > 0, the function 

/(.) = ^^^^^ 

exp(a::) + 1 

is concave and, thus, for any qn > 0, 

7(n, qn, Zn) - j{n, 0, z„) = / ( ^ ^ ) " -M r^-^ 



(A.6) < /' 

The derivative of / is 



z \ / z z 

\/n-'i) \ \/n - 9„ - 3 - 3 

2 exp(2;) 
(exp(a;) + 1)^ 
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Evaluating the right hand side of (A.6), we obtain that 



.(„,,..,„, -0<n.O,,,.) < i (l - I) log (i±^) (-^5^2^ - 1 
(A^7) <il„g''l+'="/='^ 



2 \l-Cn/i) \Vn - qn - 5 

Being derived from absolute values of partial correlations, the sequence c„ is in 
[0, 1]. Now, log[(l + x)/{l ~ x)] is a convex function of a; > that is zero at a; = 
and equal to log(2) for x = 1/3. Therefore, 



This shows that the bound in (A. 7 1 is o(c„) because, by assumption, qn ~ o{y/n). 



In particular, the bound in (A. 7 1 is less than c„/3 for sufficiently large n, proving 



the claimed consistency result. □ 
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